This notebook is one in a series of many, where we explore how different data analysis strategies affect the outcome of a proteomics experiment based on isobaric labeling and mass spectrometry. Each analysis strategy or ‘workflow’ can be divided up into different components; it is recommend you read more about that in the introduction notebook.
In this notebook specifically, we investigate the effect of varying the Normalization component on the outcome of the differential expression results. The five component variants are: median sweeping, CONSTANd, NOMAD, quantile1 (first on PSM level), quantile2 (only on protein level).
The R packages and helper scripts necessary to run this notebook are listed in the next code chunk: click the ‘Code’ button. Each code section can be expanded in a similar fashion. You can also download the entire notebook source code.
library(stringi)
library(gridExtra)
library(dendextend)
library(kableExtra)
library(psych)
library(limma)
library(tidyverse)
library(CONSTANd) # install from source: https://github.com/PDiracDelta/CONSTANd/
source('other_functions.R')
source('plotting_functions.R')
This notebook presents isobaric labeling data analysis strategy that includes data-driven normalization.
In other notebooks in this series we have systematically varied components and observed how they affect the outcome of a DEA analysis. We have seen that median sweeping normalization works does not work well for intensities on the original scale, and that CONSTANd does not work well on log2-transformed intensities. Here we compare median sweeping on log2 scale, which we know does a good job, with CONSTANd on original intensity scale.
Let’s load our PSM-level data set:
data.list <- readRDS(params$input_data_p)
dat.l <- data.list$dat.l # data in long format
# which proteins were spiked in?
spiked.proteins <- dat.l %>% distinct(Protein) %>% filter(stri_detect(Protein, fixed='ups')) %>% pull %>% as.character
tmp=dat.l %>% distinct(Protein) %>% pull %>% as.character
# protein subsampling
if (params$subsample_p>0 & params$subsample_p==floor(params$subsample_p) & params$subsample_p<=length(tmp)){
sub.prot <- tmp[sample(1:length(tmp), size=params$subsample_p)]
if (length(spiked.proteins)>0) sub.prot <- c(sub.prot,spiked.proteins)
dat.l <- dat.l %>% filter(Protein %in% sub.prot)
}
We store the metadata in sample.info and show some entries below. We also pick technical replicates with a dilution factor of 0.5 as the reference condition of interest. Each condition is represented by two of eight reporter Channels in each Run.
display_dataframe_head(sample.info)
| Run | Run.short | Channel | Sample | Sample.short | Condition | Color |
|---|---|---|---|---|---|---|
| Mixture1_1 | Mix1_1 | 127C | Mixture1_1:127C | Mix1_1:127C | 0.125 | black |
| Mixture1_1 | Mix1_1 | 127N | Mixture1_1:127N | Mix1_1:127N | 0.667 | green |
| Mixture1_1 | Mix1_1 | 128C | Mixture1_1:128C | Mix1_1:128C | 1 | red |
| Mixture1_1 | Mix1_1 | 128N | Mixture1_1:128N | Mix1_1:128N | 0.5 | blue |
| Mixture1_1 | Mix1_1 | 129C | Mixture1_1:129C | Mix1_1:129C | 0.5 | blue |
| Mixture1_1 | Mix1_1 | 129N | Mixture1_1:129N | Mix1_1:129N | 0.125 | black |
referenceCondition
## [1] "0.5"
channelNames
## [1] "127C" "127N" "128C" "128N" "129C" "129N" "130C" "130N"
First, we choose appropriate unit scales for use with median sweeping and CONSTANd.
dat.unit.l <- emptyList(variant.names)
dat.unit.l$median_sweeping <- dat.l %>% mutate(response=log2(intensity)) %>% select(-intensity)
dat.unit.l$CONSTANd <- dat.l %>% rename(response=intensity)
In the following two subsections, we apply CONSTANd and median sweeping normalization in parallel to compare them later on.
Since both methods need to be applied on matrix-like data, let’s switch to wide format. (Actually, this is semi-wide, since the Channel columns still have contributions form all Runs, but that’s OK because in the next step we split by Run.)
# switch to wide format
dat.unit.w <- lapply(dat.unit.l, function(x) {
pivot_wider(data = x, id_cols=-one_of(c('Condition', 'BioReplicate')), names_from=Channel, values_from=response)
})
dat.norm.w <- emptyList(names(dat.unit.w))
Median sweeping means subtracting from each PSM quantification value the spectrum median (i.e., the row median computed across samples/channels) and the sample median (i.e., the column median computed across features). First, let’s sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g. No need to split this per Run, because each row in this semi-wide format contains only values from one Run and each median calculation is independent of the other rows.
# subtract the spectrum median log2intensity from the observed log2intensities
dat.norm.w$median_sweeping <- dat.unit.w$median_sweeping
dat.norm.w$median_sweeping[,channelNames] <- dat.norm.w$median_sweeping[,channelNames] %>% sweep(1, apply(.[,channelNames], 1, median, na.rm=T))
display_dataframe_head(dat.norm.w$median_sweeping[, channelNames])
| 127C | 127N | 128C | 128N | 129C | 129N | 130C | 130N |
|---|---|---|---|---|---|---|---|
| -0.2088531 | -0.4106403 | -0.3641628 | 0.1561868 | -0.0627039 | 0.2413401 | 0.0831727 | 0.0627039 |
| -0.1321964 | -0.3476832 | -0.2204273 | 0.1618711 | -0.0477788 | 0.0891915 | 0.0477788 | 0.0816737 |
| 0.4734048 | -0.5087304 | -0.7217574 | 0.4347442 | 0.0890943 | 0.8964941 | -0.0890943 | -0.4143805 |
| -0.3322250 | -0.3318356 | -0.1900625 | 0.5083959 | 0.0187098 | 0.3555902 | 0.0150596 | -0.0150596 |
| -0.0199852 | -0.3901617 | -0.2237788 | 0.2481239 | 0.0199852 | 0.4575068 | 0.0399986 | -0.0470807 |
| -0.3124028 | -0.4701170 | -0.0407846 | 0.2844554 | -0.0352769 | 0.3817914 | 0.0352769 | 0.1321293 |
CONSTANd (Van Houtven et al.) normalizes a data matrix of untransformed intensities by ‘raking’ iteratively along the rows and columns (i.e. multiplying each row or column with a particular number) such that the row means and column means equal 1. One can never attain these row and column constraints simultaneously, but the algorithm converges very fast to the desired precision.
Now let’s apply CONSTANd to each Run separately, and then combine the results into a semi-wide dataframe again.
# dat.unit.l entries are in long format so all have same colnames and no channelNames
x.split <- split(dat.unit.w$CONSTANd, dat.unit.w$CONSTANd$Run) # apply CONSTANd to each Run separately
x.split.norm <- lapply(x.split, function(y) {
y[,channelNames] <- CONSTANd(y[,channelNames])$normalized_data
return(y)
})
dat.norm.w$CONSTANd <- bind_rows(x.split.norm)
display_dataframe_head(dat.norm.w$CONSTANd[, channelNames])
| 127C | 127N | 128C | 128N | 129C | 129N | 130C | 130N |
|---|---|---|---|---|---|---|---|
| 1.0621950 | 1.1178884 | 0.7538718 | 0.9803849 | 1.1324814 | 1.102989 | 0.8988097 | 0.9513802 |
| 0.9735801 | 0.9715923 | 1.0164174 | 1.1014593 | 0.8773265 | 0.986439 | 0.9902503 | 1.0829352 |
| 0.9345140 | 1.0451980 | 1.0188749 | 0.9352056 | 1.0150034 | 1.085620 | 0.9559529 | 1.0096309 |
| 1.0296268 | 0.9760853 | 0.9194507 | 1.0114029 | 1.0049258 | 1.002021 | 1.0004727 | 1.0560147 |
| 1.0072346 | 1.1000926 | 0.9995079 | 0.9500389 | 0.9420141 | 0.819079 | 1.0837106 | 1.0983224 |
| 0.9124313 | 0.9813590 | 1.0383122 | 0.9840386 | 0.9260962 | 1.001267 | 1.0613825 | 1.0951135 |
Within each Run and within each Channel, we replace multiple related observations with their median. First, for each Peptide (median of the PSM values), then for each Protein (median of the peptide values).
# group by (run,)protein,peptide then summarize twice (once on each level)
dat.norm.summ.w <- lapply(dat.norm.w, function(x){
y <- x %>% group_by(Run, Protein, Peptide) %>% summarise_at(.vars = channelNames, .funs = median, na.rm=T) %>% summarise_at(.vars = channelNames, .funs = median, na.rm=T) %>% ungroup()
return(y)})
Notice that the row sums are not equal to Ncols anymore (which they used to be after CONSTANd), because the median summarization does not preserve them (though mean summarization would).
# make data completely wide (also across runs)
## normalized data
dat.norm.summ.w2 <- lapply(dat.norm.summ.w, function(x) {
return(x %>% pivot_wider(names_from = Run, values_from = all_of(channelNames), names_glue = "{Run}:{.value}"))
})
Median sweeping needs two passes, so now that the data is on the protein level, let’s sweep all values separately per protein in the columns/samples. This is slightly different from sweeping before the summarization step because the median of medians is not the same as the grand median, but this does not introduce any bias.
# median sweeping: in each channel, subtract median computed across all proteins within the channel
# do the above separately for each MS run
x.split <- split(dat.norm.summ.w$median_sweeping, dat.norm.summ.w$median_sweeping$Run)
x.split.norm <- lapply( x.split, function(y) {
y[,channelNames] <- sweep(y[,channelNames], 2, apply(y[,channelNames], 2, median, na.rm=T) )
return(y) } )
dat.norm.summ.w$median_sweeping <- bind_rows(x.split.norm)
Before getting to the DEA section, let’s do some basic quality control and take a sneak peek at the differences between the component variants we’ve chosen. First, however, we should make the data completely wide, so that each sample gets it’s own unique column.
# make data completely wide (also across runs)
dat.norm.summ.w2 <- lapply(dat.norm.summ.w, function(x) x %>% pivot_wider(names_from = Run, values_from = all_of(channelNames), names_glue = "{Run}:{.value}") )
These boxplots of both the raw and normalized intensities show that both distributions are symmetrical, without notable differences.
# use (half-)wide format
par(mfrow=c(1,2))
for (i in seq_along(variant.names)) {
boxplot_w(dat.norm.summ.w[[variant.names[i]]], sample.info, paste('normalized', variant.names[i], sep='_'))}
We then make MA plots of two single samples taken from condition 0.5 and condition 0.125, measured in different MS runs (samples Mixture2_1:127C and Mixture1_2:129N, respectively).
There are no notable differences between CONSTANd and median sweeping.
# use wide2 format
p <- emptyList(variant.names)
for (i in 1: n.comp.variants){
p[[i]] <- maplot_ils(dat.norm.summ.w2[[i]], ma.onesample.num, ma.onesample.denom, scale.vec[i], paste('normalized', variant.names[i], sep='_'), spiked.proteins)}
grid.arrange(p[[1]], p[[2]], ncol=2)
To increase the robustness of these results, let’s make some more MA plots, but now for all samples from condition 0.5 and condition 0.125 (quantification values averaged within condition).
Again, there are no notable differences between CONSTANd and median sweeping.
channels.num <- sample.info %>% filter(Condition==ma.allsamples.num) %>% select(Sample) %>% pull
channels.denom <- sample.info %>% filter(Condition==ma.allsamples.denom) %>% select(Sample) %>% pull
p <- emptyList(variant.names)
for (i in 1:n.comp.variants){
p[[i]] <- maplot_ils(dat.norm.summ.w2[[i]], channels.num, channels.denom, scale=scale.vec[i], paste('normalized', variant.names[i], sep='_'), spiked.proteins)}
grid.arrange(p[[1]], p[[2]], ncol=2)
Now, let’s check if these multi-dimensional data contains some kind of grouping; It’s time to make PCA plots.
Even though PC1 does seem to capture the conditions, providing a gradient for the dilution number, only the 0.125 condition is completely separable in the normalized data. There are no large differences between CONSTANd and median sweeping, though CONSTANd-normalized samples seem arguably slightly more separable according to condition.
par(mfrow=c(1, 2))
for (i in 1:n.comp.variants){
pcaplot_ils(dat.norm.summ.w2[[variant.names[i]]] %>% select(-'Protein'), info=sample.info, paste('normalized', variant.names[i], sep='_'))}
par(mfrow=c(1, 2))
for (i in 1:n.comp.variants){
pcaplot_ils(dat.norm.summ.w2[[variant.names[i]]] %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, paste('normalized', variant.names[i], sep='_'))}
Notice how for all PCA plots, the percentage of variance explained by PC1 is now much greater than when using data from all proteins. There are no notable differences between CONSTANd and median sweeping. In a real situation without spiked proteins, you might plot data corresponding to the top X most differential proteins instead.
The PCA plots of all proteins has a rather lower fraction of variance explained by PC1. We can confirm this using the hierarchical clustering dendrograms below: when considering the entire multidimensional space, the different conditions are not very separable at all (except for condition 0.125 after the non-quantile normalization approaches). This is not surprising as there is little biological variation between the conditions: there are only 19 truly differential proteins, and they all (ought to) covary in exactly the same manner (i.e., their variation can be captured in one dimension).
par(mfrow=c(1, 2))
for (i in 1:n.comp.variants){
dendrogram_ils(dat.norm.summ.w2[[variant.names[i]]] %>% select(-Protein), info=sample.info, paste('normalized', variant.names[i], sep='_'))}
Our last quality check involves a measure of how well each variant was able to assist in removing the run effect. Below are the distributions of p-values from a linear model for the response variable with Run as a covariate. If the run effect was removed successfully, these p-values ought to be large.
The run effect which was present in the raw data is only partially removed by median sweeping, but almost entirely removed by CONSTANd.
dat <- list(dat.norm.summ.l$median_sweeping, dat.norm.summ.l$CONSTANd)
names(dat) <- c('median_sweeping','CONSTANd')
run_effect_plot(dat)
We look at the log2 fold changes of each condition w.r.t. the reference condition with dilution ratio 0.125. Since we are working with a log2 unit scale already, this means that for each protein we just look at the difference in mean observation across all channels between one condition and the reference condition. Note that this is not the same as looking at the log2 of the ratio of mean raw intensities for each condition (left hand side below), nor the mean ratio of raw intensities for each condition (right hand side below), since \(log_2 (\frac{mean(B)}{mean(A)}) \neq \frac{mean(log_2 (B))}{mean(log_2 (A))} \neq mean(\frac{log_2 (B)}{log_2 (A)})\).
To check whether these fold changes are significant (criterium: \(q<0.05\)), we use a moderated t-test slightly adapted from the limma package, which in use cases like ours should improve statistical power over a regular t-test. In a nutshell, this is a t-test done independently for each protein, although the variance used in the calculation of the t-statistic is moderated using some empirical Bayes estimation. After testing, we make a correction for multiple testing using the Benjamini-Hochberg method in order to keep the FDR under control.
# design matrix as used in ANOVA testing.
design.matrix <- get_design_matrix(referenceCondition, sample.info)
dat.dea <- emptyList(names(dat.norm.summ.w2))
for (i in seq_along(dat.norm.summ.w2)) {
this_scale <- scale.vec[match(names(dat.dea)[i], variant.names)]
d <- column_to_rownames(as.data.frame(dat.norm.summ.w2[[variant.names[i]]]), 'Protein')
dat.dea[[variant.names[i]]] <- moderated_ttest(dat=d, design.matrix, scale=this_scale)}
Now, the most important part: let’s find out how our component variants have affected the outcome of the DEA.
A confusion matrix shows how many true and false positives/negatives each variant has given rise to. Spiked proteins that are DE are true positives, background proteins that are not DE are true negatives. We calculate this matrix for all conditions and then calculate some other informative metrics based on the confusion matrices: accuracy, sensitivity, specificity, positive predictive value and negative predictive value.
All in all, both methods give rise to comparable confusion matrices. They differ slightly from one condition comparison to another, but whenever either method is more sensitive it is also less specific.
cm <- conf_mat(dat.dea, 'q.mod', 0.05, spiked.proteins)
print_conf_mat(cm, referenceCondition)
| background | spiked | background | spiked | |
|---|---|---|---|---|
| not_DE | 4061 | 4 | 4062 | 5 |
| DE | 3 | 15 | 2 | 14 |
| median_sweeping | CONSTANd | |
|---|---|---|
| Accuracy | 0.998 | 0.998 |
| Sensitivity | 0.789 | 0.737 |
| Specificity | 0.999 | 1.000 |
| PPV | 0.833 | 0.875 |
| NPV | 0.999 | 0.999 |
| background | spiked | background | spiked | |
|---|---|---|---|---|
| not_DE | 4064 | 18 | 4063 | 15 |
| DE | 0 | 1 | 1 | 4 |
| median_sweeping | CONSTANd | |
|---|---|---|
| Accuracy | 0.996 | 0.996 |
| Sensitivity | 0.053 | 0.211 |
| Specificity | 1.000 | 1.000 |
| PPV | 1.000 | 0.800 |
| NPV | 0.996 | 0.996 |
| background | spiked | background | spiked | |
|---|---|---|---|---|
| not_DE | 4060 | 5 | 4055 | 3 |
| DE | 4 | 14 | 9 | 16 |
| median_sweeping | CONSTANd | |
|---|---|---|
| Accuracy | 0.998 | 0.997 |
| Sensitivity | 0.737 | 0.842 |
| Specificity | 0.999 | 0.998 |
| PPV | 0.778 | 0.640 |
| NPV | 0.999 | 0.999 |
To see whether the Normaliztion methods produce similar results on the detailed level of individual proteins, we make scatter plots and check the correlation between their fold changes and between their significance estimates (q-values, in our case).
Though there are differences between the methods, for all conditions the q-values are quite well-correlated (\(>0.864\)) and the fold changes extremely well-correlated (\(>0.986\)).
scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition)
scatterplot_ils(dat.dea, logFC.cols, 'log2FC', spiked.proteins, referenceCondition)
The volcano plot combines information on fold changes and statistical significance. The spike-in proteins are colored blue, and immediately it is clear that their fold changes dominate the region of statistical significance, which suggests the experiment and analysis were carried out successfully. The magenta, dashed line indicates the theoretical fold change of the spike-ins.
The quantile plots are extremely flat, suggesting the approach is not powerufl enough. We can again see how CONSTANd fold changes are disproportionately low in absolute value.
for (i in 1:n.contrasts){
volcanoplot_ils(dat.dea, i, spiked.proteins, referenceCondition)}
A good way to assess the general trend of the fold change estimates on a more ‘macroscopic’ scale is to make a violin plot. Ideally, there will be some spike-in proteins that attain the expected fold change (red dashed line) that corresponds to their condition, while most (background) protein log2 fold changes are situated around zero.
Clearly, the empirical results tend towards the theoretical truth, but not a single observation attained the fold change it should have attained. There is clearly a strong bias towards zero fold change, which may partly be explained by the ratio compression phenomenon in mass spectrometry, although the effect seems quite extreme here.
Both CONSTANd and median sweeping give rise to very similar distributions.
# plot theoretical value (horizontal lines) and violin per variant
violinplot_ils(lapply(dat.dea, function(x) x[spiked.proteins, logFC.cols]), referenceCondition)
When applying both CONSTANd and median sweeping normalization on the unit scale they are meant for (untransformed and \(log_2\)-transformed intensities, respectively) they perform very comparably. The only notable difference was the fact that CONSTANd was able to completely remove the run-effect, while median sweeping only partially did so.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_BE.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=de_BE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_BE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_BE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] CONSTANd_0.99.5 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2
## [5] purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.4
## [9] ggplot2_3.3.2 tidyverse_1.3.0 limma_3.45.19 psych_2.0.9
## [13] kableExtra_1.3.1 dendextend_1.14.0 gridExtra_2.3 stringi_1.5.3
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-150 fs_1.5.0 lubridate_1.7.9
## [4] webshot_0.5.2 httr_1.4.2 tools_4.0.3
## [7] backports_1.1.10 R6_2.4.1 rpart_4.1-15
## [10] DBI_1.1.0 mgcv_1.8-33 colorspace_1.4-1
## [13] nnet_7.3-14 withr_2.3.0 tidyselect_1.1.0
## [16] mnormt_2.0.2 compiler_4.0.3 cli_2.1.0
## [19] rvest_0.3.6 xml2_1.3.2 labeling_0.4.2
## [22] scales_1.1.1 digest_0.6.27 rmarkdown_2.5
## [25] pkgconfig_2.0.3 htmltools_0.5.0 dbplyr_1.4.4
## [28] highr_0.8 rlang_0.4.8 readxl_1.3.1
## [31] rstudioapi_0.11 farver_2.0.3 generics_0.0.2
## [34] jsonlite_1.7.1 ModelMetrics_1.2.2.2 magrittr_1.5
## [37] Matrix_1.2-18 Rcpp_1.0.5 munsell_0.5.0
## [40] fansi_0.4.1 viridis_0.5.1 lifecycle_0.2.0
## [43] pROC_1.16.2 yaml_2.2.1 MASS_7.3-53
## [46] plyr_1.8.6 recipes_0.1.14 grid_4.0.3
## [49] blob_1.2.1 parallel_4.0.3 crayon_1.3.4
## [52] lattice_0.20-41 haven_2.3.1 splines_4.0.3
## [55] hms_0.5.3 tmvnsim_1.0-2 knitr_1.30
## [58] pillar_1.4.6 stats4_4.0.3 reshape2_1.4.4
## [61] codetools_0.2-16 reprex_0.3.0 glue_1.4.2
## [64] evaluate_0.14 data.table_1.13.2 modelr_0.1.8
## [67] vctrs_0.3.4 foreach_1.5.1 cellranger_1.1.0
## [70] gtable_0.3.0 assertthat_0.2.1 xfun_0.18
## [73] gower_0.2.2 prodlim_2019.11.13 broom_0.7.2
## [76] e1071_1.7-4 survival_3.2-7 class_7.3-17
## [79] viridisLite_0.3.0 timeDate_3043.102 iterators_1.0.13
## [82] lava_1.6.8 ellipsis_0.3.1 caret_6.0-86
## [85] ipred_0.9-9